Big Data Computing：Batch Processing

In the previous article, [Big Data Storage: HDFS](https://xx/Big Data Storage：HDFS), we discussed the design principles of HDFS and its continuous architectural optimizations in practice. With its distributed, scalable, and fault-tolerant features, HDFS has gradually become the cornerstone of big data storage.

This article focuses on the computing aspect from [Deconstructing Big Data：Storage, Computing, and Querying](https://xx/Deconstructing Big Data：Storage, Computing, and Querying).
Big data computing can be divided into offline computing and real-time computing, where offline computing is also known as batch processing.

Here, we will first focus on batch processing, exploring its principles, architecture, frameworks, and application scenarios.

What Is Batch Processing?

Batch Processing is a big data computing method in which data is collected in batches, processed in bulk, and outputs results in a single run. It is suitable for scenarios involving large-scale data, complex computation logic, and high tolerance for latency — such as daily or monthly report generation.

The core concepts of batch processing are:

Collect the input dataset in batches
Perform parallel computation on the data
Aggregate the parallel computation results and output a complete result

Key characteristics include:

Immutability of data – Once collected, the data will not be modified.
Large scale – Typically processes data collected over a day, a month, or even a year.
One-time execution – Each batch job is executed at a scheduled time within a cycle.
High accuracy – With immutable and complete data, results tend to be more accurate.
High throughput – Distributed computing can efficiently utilize cluster resources.

Batch Processing Architecture

Following its core principles, a batch processing system typically consists of several layers:

Data Source – The origin of the data, usually generated by business systems; may include database records or event tracking logs.
Ingestion Layer & Storage – Transfers data from sources into the batch processing system for storage, often through storage-specific SDKs or APIs. Common storage mediums include HDFS and Amazon S3.
Compute Engine – The core component that loads, filters, and transforms data.
Resource Management & Scheduling – Since compute engines typically run in distributed environments, resource management and job scheduling are needed. YARN is commonly used for this purpose.
Output Layer & Storage – Writes computation results to a target storage system via SDK or API, similar to the ingestion process.

Batch Processing Frameworks

Two widely used batch processing frameworks are MapReduce and Spark.

Hadoop MapReduce – The open-source implementation of Google’s theoretical model, and one of the earliest distributed batch processing frameworks. It consists of two core phases: Map and Reduce.
Spark – Evolves from the MapReduce model by introducing in-memory computing and DAG (Directed Acyclic Graph) task scheduling, enabling more complex computations.

The MapReduce programming model splits jobs into two main phases:

Map – Transforms input data into a set of key/value pairs.
Reduce – Aggregates values associated with the same key and outputs the final result.

However, complex jobs often require chaining multiple MapReduce jobs. Each job reads data from disk and writes back to disk upon completion, with frequent job startup and shutdown.
Spark addresses these issues by using in-memory processing and DAG scheduling, allowing multiple MapReduce-like operations to be executed within a single Spark job.

Spark job execution flow:

Job Submission – Submitted via Spark SDK or command line.
DAG Construction – Built based on Spark operators.
Job Partitioning & Scheduling – The DAG is split into stages for scheduling.
Data Processing – Data shuffling occurs only when global aggregation or join operations are needed.
Output Results – Computation results are written to the target storage.

Application Scenarios

Batch processing still holds a critical position in enterprise data architectures. Common use cases include:

Offline Data Warehouse Construction – Using Hive to clean and aggregate raw data into wide tables for business analytics.
Offline Data Analysis – Performing statistical analysis across various dimensions in a data warehouse.
Machine Learning Model Training – Processing historical data in bulk to produce datasets for model training.

Limitations & Challenges

Despite its importance, batch processing faces several challenges:

High Latency – Lacks real-time responsiveness, unsuitable for low-latency use cases.
High Resource Usage – Large datasets require significant computation resources.
Data Skew – Hotspot data can grow disproportionately, leading to uneven workload distribution.
High Maintenance Costs – Failed jobs often require a full rerun, consuming time and resources.

Conclusion

In the big data ecosystem, computing is a crucial link between data storage and application. Batch processing remains the backbone for offline data warehouses, analytics, and machine learning training workloads.
While real-time processing is gaining momentum, batch processing is still irreplaceable.

In the next article, we will dive into big data real-time processing, exploring how it compensates for batch processing’s shortcomings and the unique challenges it faces.